[DataFrame] Implement Indexer getitem#1955
[DataFrame] Implement Indexer getitem#1955simon-mo wants to merge 2 commits intoray-project:masterfrom
Conversation
|
Test PASSed. |
devin-petersohn
left a comment
There was a problem hiding this comment.
This is really great! I left some comments and questions.
| Metadata for the new dataframe's columns | ||
| partial (boolean): | ||
| Internal: row_metadata and col_metadata only covers part of the | ||
| block partitions. (Used in index 'vew' accessor) |
| copy=False, col_partitions=None, row_partitions=None, | ||
| block_partitions=None, row_metadata=None, col_metadata=None): | ||
| block_partitions=None, row_metadata=None, col_metadata=None, | ||
| partial=False): |
There was a problem hiding this comment.
Would it make more sense to have a DataFrameView object as a subclass of this one?
There was a problem hiding this comment.
As I'm reading through this, I think it might make more sense to have it as a subclass. Changing the way that _block_partitions is handled in the case that it's a view makes complicated code more complicated.
Alternatively, if we decide otherwise, I would still like to see more comments about the _block_partitions changes.
There was a problem hiding this comment.
I agree we should have it as a subclass. DataFrameView object make sense.
I'm refactoring parts of my indexing.py to make enlargement inside the parent class of the _LocIndexer and _iLocIndexer. So that we have cleaner code in indexing.py
| if block_partitions is not None: | ||
| # put in numpy array here to make accesses easier since it's 2D | ||
| self._block_partitions = np.array(block_partitions) | ||
| if row_metadata is not None: |
There was a problem hiding this comment.
Why do we need this in two places?
| Returns: | ||
| The dtypes for this DataFrame. | ||
| """ | ||
| # Deal with empty column case |
There was a problem hiding this comment.
Prefer comment to say Deal with a DataFrame with no columns or something along those lines. Empty columns feels a bit ambiguous (could mean all NaN in some cases).
| Internal: row_metadata and col_metadata only covers part of the | ||
| block partitions. (Used in index 'vew' accessor) | ||
| """ | ||
| self.partial = partial |
| if is_2d(row_loc) and is_2d(col_loc): | ||
| return self._get_dataframe_view(row_loc, col_loc) | ||
|
|
||
| def _get_scaler(self, row_loc, col_loc): |
There was a problem hiding this comment.
Are you planning to implement these? It would be better to have a NotImplementedError than just pass. That way we don't have some silent internal error or something.
| axis = 0 | ||
| columns = pd_df.columns | ||
| index = pd_df.index | ||
| self._row_metadata = self._col_metadata = None |
There was a problem hiding this comment.
Duplicated (Remove this in favor of Line 73)
| retrieved_rows_remote = self._map_partition( | ||
| lookup_dict, col_label, indexer='loc') | ||
| joined_df = pd.concat(ray.get(retrieved_rows_remote)) | ||
| def _get_scaler(self, row_loc, col_loc): |
There was a problem hiding this comment.
Would you be able to add some comments about what these methods do (this and methods below)? Just so that we can quickly look at the code and maintain it better.
| # The returned result need to be indexed series/df | ||
| # Re-index is needed. | ||
| joined_df.index = index_loc.index | ||
| def _get_scaler(self, row_loc, col_loc): |
There was a problem hiding this comment.
Same on this file about method level documentation.
|
@devin-petersohn I'll make the following change in the |
|
Closed via #2020 |

What do these changes do?
locandiloc's getitem methodsNote